Journals
  Publication Years
  Keywords
Search within results Open Search
Please wait a minute...
For Selected: Toggle Thumbnails
Triplet deep hashing method for speech retrieval
Qiuyu ZHANG, Yongwang WEN
Journal of Computer Applications    2023, 43 (9): 2910-2918.   DOI: 10.11772/j.issn.1001-9081.2022081149
Abstract182)   HTML5)    PDF (2003KB)(65)       Save

The existing deep hashing methods of content-based speech retrieval do not make enough use of supervised information and have the suboptimal generated hash codes, low retrieval precision and low retrieval efficiency. To address the above problems, a triplet deep hashing method for speech retrieval was proposed. Firstly, the spectrogram image features were used as the input of the model in triplet manner to extract the effective information of the speech feature. Then, an Attentional mechanism-Residual Network (ARN) model was proposed, that is, the spatial attention mechanism was embedded on the basis of the ResNet (Residual Network), and the salient region representation was improved by aggregating the energy salient region information in the whole spectrogram. Finally, a novel triplet cross-entropy loss was introduced to map the classification information and similarity between spectrogram image features into the learned hash codes, thereby achieving the maximum class separability and maximal hash code discriminability during model training. Experimental results show that the efficient and compact binary hash codes generated by the proposed method has the recall, precision and F1 score of over 98.5% in speech retrieval. Compared with methods such as single-label retrieval method, the average running time of the proposed method using Log-Mel spectra as features is shorted by 19.0% to 55.5%. Therefore, this method can improve the retrieval efficiency and retrieval precision significantly while reducing the amount of computation.

Table and Figures | Reference | Related Articles | Metrics
Speech classification model based on improved Inception network
Qiuyu ZHANG, Yukun WANG
Journal of Computer Applications    2023, 43 (3): 909-915.   DOI: 10.11772/j.issn.1001-9081.2022010047
Abstract339)   HTML10)    PDF (1970KB)(98)       Save

Aiming at the complicated process of extracting audio features by traditional audio classification models, and problems of the existing neural network models such as overfitting, low classification accuracy, and vanishing gradient, a speech classification model based on improved Inception network was proposed. Firstly, in order to avoid the vanishing gradient while increasing the depth of the network, the residual skip connection idea in Residual Network (ResNet) was added into the model to improve the traditional Inception V2 model. Secondly, the size of the convolution kernel in the Inception module was optimized, and the deep features of Log-Mel spectrogram of the original speech were extracted by using different sizes of convolutions, so that the model was able to select the appropriate convolution to process the data through self-learning. At the same time, the model was improved in depth and width dimensions in order to increase the classification accuracy. Finally, the trained network model was used to classify and predict the speech data, and the classification result was obtained through the Softmax function. Experimental results on Tsinghua University Chinese speech database THCHS-30 and ambient sound dataset UrbanSound8K show that the classification accuracy of the improved Inception network model on the above two datasets is 92.76% and 93.34% respectively. Compared with models such as Visual Geometry Group (VGG16), InceptionV2 and GoogLeNe, the classification accuracy of the proposed model is the best, with a maximum increase of 27.30 percentage points. It can be seen that the proposed model has stronger feature fusion ability and more accurate classification results, can solve problems such as overfitting and vanishing gradient.

Table and Figures | Reference | Related Articles | Metrics